5  Statistics

(ns stats
  (:require [scicloj.noj.v1.datasets :as datasets]
            [scicloj.noj.v1.stats :as stats]))

5.1 Correlation matrices

The stats/calc-correlations-matrix function commputes the correlation matrix of selected columns of a given dataset, organizing the resulting data as a dataset.

(-> datasets/iris
    (stats/calc-correlations-matrix
     [:sepal-length :sepal-width :petal-length :petal-width]))

_unnamed [16 3]:

:col-1 :col-2 :corr
:sepal-length :sepal-length 1.00000000
:sepal-length :sepal-width -0.11000000
:sepal-length :petal-length 0.87000000
:sepal-length :petal-width 0.81000000
:sepal-width :sepal-length -0.11000000
:sepal-width :sepal-width 1.00000000
:sepal-width :petal-length -0.41999999
:sepal-width :petal-width -0.36000001
:petal-length :sepal-length 0.87000000
:petal-length :sepal-width -0.41999999
:petal-length :petal-length 1.00000000
:petal-length :petal-width 0.95999998
:petal-width :sepal-length 0.81000000
:petal-width :sepal-width -0.36000001
:petal-width :petal-length 0.95999998
:petal-width :petal-width 1.00000000

5.2 Multivariate regression

The stats/regression-model function computes a regressiom model (using scicloj.ml) and adds some relevant information such as the R^2 measure.

(-> datasets/iris
    (stats/regression-model
     :sepal-length
     [:sepal-width :petal-length :petal-width]
     {:model-type :smile.regression/elastic-net})
    (dissoc :model-data))
{:feature-columns [:sepal-width :petal-length :petal-width],
 :target-columns [:sepal-length],
 :explained #function[malli.core/-instrument/fn--55033],
 :R2 0.8582120394597336,
 :id #uuid "6ddb6cd8-8438-46e5-be59-69b02ba90b57",
 :predictions #tech.v3.dataset.column<float64>[150]
:sepal-length
[5.022, 4.724, 4.775, 4.851, 5.081, 5.360, 4.911, 5.030, 4.664, 4.903, 5.209, 5.098, 4.775, 4.572, 5.184, 5.522, 5.089, 4.970, 5.352, 5.217...],
 :predict
 #function[scicloj.noj.v1.stats/regression-model/predict--59249],
 :options {:model-type :smile.regression/elastic-net}}
(-> datasets/iris
    (stats/regression-model
     :sepal-length
     [:sepal-width :petal-length :petal-width]
     {:model-type :smile.regression/ordinary-least-square})
    (dissoc :model-data))
{:feature-columns [:sepal-width :petal-length :petal-width],
 :target-columns [:sepal-length],
 :explained #function[malli.core/-instrument/fn--55033],
 :R2 0.8586117200664085,
 :id #uuid "f0804379-8d3a-4854-92d2-16b6ad7c63d0",
 :predictions #tech.v3.dataset.column<float64>[150]
:sepal-length
[5.015, 4.690, 4.749, 4.826, 5.080, 5.377, 4.895, 5.021, 4.625, 4.882, 5.216, 5.092, 4.746, 4.533, 5.199, 5.561, 5.094, 4.960, 5.368, 5.226...],
 :predict
 #function[scicloj.noj.v1.stats/regression-model/predict--59249],
 :options {:model-type :smile.regression/ordinary-least-square}}

The stats/linear-regression-model convenience function uses specifically the :smile.regression/ordinary-least-square model type.

(-> datasets/iris
    (stats/linear-regression-model
     :sepal-length
     [:sepal-width :petal-length :petal-width])
    (dissoc :model-data))
{:feature-columns [:sepal-width :petal-length :petal-width],
 :target-columns [:sepal-length],
 :explained #function[malli.core/-instrument/fn--55033],
 :R2 0.8586117200664085,
 :id #uuid "86e63b86-98c1-4216-8807-1fa290ea03eb",
 :predictions #tech.v3.dataset.column<float64>[150]
:sepal-length
[5.015, 4.690, 4.749, 4.826, 5.080, 5.377, 4.895, 5.021, 4.625, 4.882, 5.216, 5.092, 4.746, 4.533, 5.199, 5.561, 5.094, 4.960, 5.368, 5.226...],
 :predict
 #function[scicloj.noj.v1.stats/regression-model/predict--59249],
 :options {:model-type :smile.regression/ordinary-least-square}}

5.3 Adding regression predictions to a dataset

The stats/add-predictions function models a target column using feature columns, adds a new prediction column with the model predictions.

(-> datasets/iris
    (stats/add-predictions
     :sepal-length
     [:sepal-width :petal-length :petal-width]
     {:model-type :smile.regression/ordinary-least-square}))

_unnamed [150 6]:

:sepal-length :sepal-width :petal-length :petal-width :species :sepal-length-prediction
5.1 3.5 1.4 0.2 setosa 5.01541576
4.9 3.0 1.4 0.2 setosa 4.68999718
4.7 3.2 1.3 0.2 setosa 4.74925142
4.6 3.1 1.5 0.2 setosa 4.82599409
5.0 3.6 1.4 0.2 setosa 5.08049948
5.4 3.9 1.7 0.4 setosa 5.37719368
4.6 3.4 1.4 0.3 setosa 4.89468378
5.0 3.4 1.5 0.2 setosa 5.02124524
4.4 2.9 1.4 0.2 setosa 4.62491347
4.9 3.1 1.5 0.1 setosa 4.88164236
6.9 3.1 5.4 2.1 virginica 6.53429168
6.7 3.1 5.6 2.4 virginica 6.50917327
6.9 3.1 5.1 2.3 virginica 6.21025556
5.8 2.7 5.1 1.9 virginica 6.17251376
6.8 3.2 5.9 2.3 virginica 6.84264484
6.7 3.3 5.7 2.5 virginica 6.65460564
6.7 3.0 5.2 2.3 virginica 6.21608504
6.3 2.5 5.0 1.9 virginica 5.97143313
6.5 3.0 5.2 2.0 virginica 6.38302984
6.2 3.4 5.4 2.3 virginica 6.61824630
5.9 3.0 5.1 1.8 virginica 6.42341317

It attaches the model’s information to the metadata of that new column.

(-> datasets/iris
    (stats/add-predictions
     :sepal-length
     [:sepal-width :petal-length :petal-width]
     {:model-type :smile.regression/ordinary-least-square})
    :sepal-length-prediction
    meta
    (update :model
            dissoc :model-data :predict :predictions))
{:name :sepal-length-prediction,
 :datatype :float64,
 :n-elems 150,
 :column-type :prediction,
 :model
 {:feature-columns [:sepal-width :petal-length :petal-width],
  :target-columns [:sepal-length],
  :explained #function[malli.core/-instrument/fn--55033],
  :R2 0.8586117200664085,
  :id #uuid "ee962d92-d52f-4114-ad61-9104b3739651",
  :options {:model-type :smile.regression/ordinary-least-square}}}

5.4 Histograms

The stats/histogram function computes the necessary data to plot a histogram.

(-> (repeatedly 99 rand)
    (stats/histogram {:bin-count 5}))

_unnamed [5 3]:

:count :left :right
20 0.03675103 0.22704316
22 0.22704316 0.41733530
16 0.41733530 0.60762744
17 0.60762744 0.79791958
24 0.79791958 0.98821172
source: notebooks/stats.clj